Chapter 5 Parts-of-Speech Tagging
In many textual analyses, word classes can give us additional information about the text we analyze. These word classes typically are referred to as parts-of-speech tags of the words. In this chapter, we will show you how to POS tag a raw-text corpus to get the syntactic categories of words, and what to do with those POS tags.
In particular, I will introduce a powerful package spacyr, which is an R wrapper to the spaCy— “industrial strength natural language processing” Python library from https://spacy.io. In addition to POS tagging, the package provides other linguistically relevant annotations for more in-depth analysis of the English texts.
Again, the spaCy is optimized for many languages but Chinese. We will talk about Chinese text processing in a later chapter.
5.1 Installing the Package
Please consult the spacyr github for more instructions on installing the package.
There are at least four steps:
Install miniconda (or any other conda version for Python)
Install the
spacyrR package
- Because
spacyris an R wrapper to a Python pacakgespaCy, now we need to install the python module (and the language model files) as well.
The easiest way to install Python spaCy is to install it in Rstudio through the R function spacyr::spacy_install(). This function by default creates a new conda environment called spacy_condaenv, as long as some version of conda has been installed on the user’s the system.
Please also note that spacyr uses Python 3.6.x and spaCy 2.2.3+.
The spacy_install() will create a stand-alone conda environment including a python executable separate from your system Python (or anaconda python), install the latest version of spaCy (and its required packages), and download the English language model.
Step 1 is very important. If you don’t have any conda version installed on your system, you can install miniconda from [https://conda.io/miniconda.html]https://conda.io/miniconda.html. (Choose the 64-bit version.) Also, the spacy_install() will automatically install the miniconda (if there’s no conda installed on the system) for MAC users.
Windows users may need to consult the spacyr github for more important instructions on installation.
For Windows, you need to run RStudio as an administrator to make installation work properly. To do so, right click the RStudio icon (or R desktop icon) and select “Run as administrator” when launching RStudio.
- Restart R and Initialize spaCy in R
5.2 Quick Overview
The spacyr provides a useful function, spacy_parse(), which allows us to parse an English text in a very convenient way.
txt <- c(d1 = "spaCy is great at fast natural language processing.",
d2 = "Mr. Smith spent two years in North Carolina.")
parsedtxt <- spacy_parse(txt,
pos = T,
tag = T,
lemma = T,
entity = T,
dependency = T)
parsedtxtThe output parsedtext is a data frame, which includes annotations of the original texts at multiple granularities.
- All texts have been tokenized into words with each word, sentence, and text given an unique ID (i.e.,
doc_id,sentence_id,token_id) - Lemmatization is also done (i.e.,
lemma) - POS Tags can also be found (i.e.,
posandtag)pos: this column uses the Universal tagset for parts-of-speech, a general POS scheme that would suffice most needs, and provides equivalencies across languagestag: this column provides a more detailed tagset, defined in each spaCy language model. For English, this is the OntoNotes 5 version of the Penn Treebank tag set (cf. Penn Treebank Tagset)
- Depending on the argument setting for
spacy_parse(), you can get more annotations, such as named entities (entity) and dependency relations (del_rel).
5.3 Working Pipeline
In Chapter 4, we provide a primitive working pipeline for text analytics. Here we like to revise the workflow to satisfy different goals in computational text analytics (See Figure 5.1).
After we secure a collection of raw texts as our corpus, if we do not need additional parts-of-speech information, we follow the workflow on the right.
If we need additional annotations from spacyr, we follow the workflow on the left.
Figure 5.1: English Text Analytics Flowchart
5.4 Parsing Your Texts
Now let’s use this spacy_parse() to analyze the presidential addresses we’ve seen in Chapter 4: the data_corpus_inaugural from quanteda.
To illustrate the annotation more clearly, let’s parse the first text in data_corpus_inaugural:
We can parse the whole corpus collection as well: we first apply the spacy_parse to each text in data_corpus_inaugural using map() and then rbind() individual resulting data frames into one using do.call().
system.time(
corp_us_words <- data_corpus_inaugural %>%
map(spacy_parse, tag = T) %>% # purrr::map()
do.call(rbind, .)) #bind_rows## user system elapsed
## 23.464 0.523 24.002
The function system.time() is a useful function which gives you the CPU times that the expression in the parathesis used. In other words, you can put any R expression in the parenthesis of system.time() as its argument and measure the time required for the expression.
This is sometimes necessary because some of the data processing can be very time consuming. And we would like to know HOW time-consuming it is in case that we may need to run the prodecure again.
Before we move on, we need to clean up the doc_id column of corp_us_words. We somehow lost the document ID’s when we used the map().
Now the document ID information is in the row names of corp_us_words. So we retreive the document filenames in the row names as the doc_id.
corp_us_words <-corp_us_words %>%
mutate(doc_id = str_replace(row.names(.), "\\.\\d+$",""))
corp_us_wordscorp_us_words into one as provided below? (You may name the sentence-based data frame as corp_us_sents.)
5.5 Metalingusitic Analysis
Now spacy_parse() has enriched our corpus data with more linguistic annotations. We can now utilize the additional POS tags for more analysis.
In many applied linguistics studies, people sometimes look at the syntactic complexity of the language across a particular factor. For example, people may look at the syntacitc complexity development of L2 learners of varying proficiency levels, or of L1 speakers in different acquisition stages, or of writers in different genres (e.g., academic vs. nonacademic).
To operationalize the construct sytactmic complexity, we use a simple metric, Fichtner's C, which is defined as:
\[ Fichtner's\;C = \frac{Number\;of\;Verbs}{Number\;of\;Sentences} \times \frac{Number\;of\;Words}{Number\;of\;Sentences} \]
Now we can take the corp_us_words and first generate the frequencies of verbs, and number of words for each presidential speech text.
syn_com <- corp_us_words %>%
group_by(doc_id) %>%
summarize(verb_num = sum(pos=="VERB"),
sent_num = max(sentence_id),
word_num = n()) %>%
mutate(F_C = (verb_num/sent_num)*(word_num/sent_num)) %>%
ungroup
syn_comWith the syntactic complexity of each president, we can plot the tendency:
syn_com %>%
ggplot(aes(x = doc_id, y = F_C, fill = doc_id)) +
geom_col() +
theme(axis.text.x = element_text(angle=90)) +
labs(title = "Syntactic Complexty", x = "Presidents", y = "Fichtner's C") +
guides(fill = F)
It’s interesting to see a decreasing trend in syntactic complexity!

5.6 Construction Analysis
Now with parts-of-speech tags, we are able to look at more linguistic patterns or constructions in detail. These POS tags allow us to extract more precisely the target patterns we are interested in.
In this section, we will use the output from Exercise 5.1. We assume that now we have a sentence-based corpus data frame, corp_us_sents. Here I like to provide a case study on English Preposition Phrases.
## ######################################
## If you haven't finished the exercise,
## the dataset is also available in
## `demo_data/corp_us_sents.RDS
## ######################################
## Uncomment this line if you dont have `corp_us_sents`
# corp_us_sents <- readRDS("demo_data/corp_us_sents.RDS")
corp_us_sentsWe can utilize the regular expressions to extract PREP + NOUN combinations from the corpus data.
# define regex patterns
pattern_pat1 <- "[^/ ]+/ADP [^/]+/NOUN"
# extract patterns from corp
corp_us_sents %>%
unnest_tokens(output = pat_pp,
input = sentence_tag,
token = function(x) str_extract_all(x, pattern=pattern_pat1)) -> result_pat1
result_pat1In the above example, we specify the token= argument in unnest_tokens(..., token = ...) with a self-defined function. The idea of tokenization in unnest_tokens() is that the token argument should be a function which takes a text-based vector as input (i.e, each element of the input vector may be a document text) and returns a list, each element of which is a token-based version (i.e., vector) of the original input vector element (cf. Figure 5.2).
Figure 5.2: Intuition for token= in unnest_tokens()
In our demonstration, we define a tokenization function, which takes sentence_tag as the input and returns a list, each element of which consists a vector of tokens matching the regular expressions in individual sentences in sentence_tag. (Note: The function object is not assigned to an object name, thus never being created in the R working session.)
pat_clean, with all annotations removed in the data frame result_pat1.
With these constructional tokens of English PP’s, we can then do further analysis.
- We first identify the PREP and NOUN for each constructional token.
- We then clean up the data by removing POS annotations.
# extract the prep and head
result_pat1 %>%
tidyr::separate(col="pat_pp", into=c("PREP","NOUN"), sep="\\s+" ) %>%
mutate(PREP = str_replace_all(PREP, "/[^ ]+",""),
NOUN = str_replace_all(NOUN, "/[^ ]+","")) -> result_pat1a
result_pat1aNow we are ready to explore the text data.
- We can look at how each preposition is being used by different presidents:
- We can examine the most frequent NOUN that co-occurs with each PREP:
# Most freq NOUN for each PREP
result_pat1a %>%
count(PREP, NOUN) %>%
group_by(PREP) %>%
top_n(1,n) %>%
arrange(desc(n))- We can also look at a more complex usage pattern: how each president uses the PREP
ofin terms of their co-occurring NOUNs?
# NOUNS for `of` uses across different presidents
result_pat1a %>%
filter(PREP == "of") %>%
count(doc_id, PREP, NOUN) %>%
tidyr::pivot_wider(
id_cols = c("doc_id"),
names_from = "NOUN",
values_from = "n",
values_fill = list(n=0))corp_us_sents. Specifically, we can define an English PP as a sequence of words, which start with a preposition, and end at the first word after the preposition that is tagged as NOUN, PROPN, or PRON.
5.7 Issues on Pattern Retrieval
Any automatic pattern retrieval comes with a price: there are always errors returned by the system.
I would like to discuss this issue based on the second text, 1793-Washington. First let’s take a look at the Preposition Phrases extracted by my regular expression used in Exercise 5.4 and 5.5:
## ######################################
## If you haven't finished the exercise,
## the dataset is also available in
## `demo_data/result_pat2a.RDS
## ######################################
## uncomment this line if you dont have `result_pat2a`
# result_pat2a <- readRDS("demo_data/result_pat2a.RDS")
result_pat2a %>%
filter(doc_id == "1793-Washington")My regular expression has identified 20 PP’s from the text. However, if we go through the text carefully and do the PP annotation manually, we may have different results.
Figure 5.3: Manual Annotation of English PP’s in 1793-Washington
There are two types of errors:
- False Positives: Patterns identified by the system but in fact they are not true patterns.
- False Negatives: True patterns in the data but are not successfully identified by the system.
As shown in Figure 5.3, manual annotations have identified 21 PP’s from the text while the regular expression identified 20 tokens. A comparison of the two results shows that:
- In the regex result, the following returned tokens (rows highlighted in red) are False Positives—the regular expression identified them as PP but in fact they were NOT PP according to manual annotations.
| doc_id | sentence_id | PREP | NOUN | pat_pp | row_id |
|---|---|---|---|---|---|
| 1793-Washington | 1 | by | voice | by/adp the/det voice/noun | 1 |
| 1793-Washington | 1 | of | country | of/adp my/det country/noun | 2 |
| 1793-Washington | 1 | of | chief | of/adp its/det chief/propn | 3 |
| 1793-Washington | 2 | for | it | for/adp it/pron | 4 |
| 1793-Washington | 2 | of | honor | of/adp this/det distinguished/adj honor/noun | 5 |
| 1793-Washington | 2 | of | confidence | of/adp the/det confidence/noun | 6 |
| 1793-Washington | 2 | in | me | in/adp me/pron | 7 |
| 1793-Washington | 2 | by | people | by/adp the/det people/noun | 8 |
| 1793-Washington | 2 | of | united | of/adp united/propn | 9 |
| 1793-Washington | 3 | to | execution | to/adp the/det execution/noun | 10 |
| 1793-Washington | 3 | of | act | of/adp any/det official/adj act/noun | 11 |
| 1793-Washington | 3 | of | president | of/adp the/det president/propn | 12 |
| 1793-Washington | 3 | of | office | of/adp office/noun | 13 |
| 1793-Washington | 4 | in | presence | in/adp your/det presence/noun | 14 |
| 1793-Washington | 4 | during | administration | during/adp my/det administration/noun | 15 |
| 1793-Washington | 4 | of | government | of/adp the/det government/propn | 16 |
| 1793-Washington | 4 | in | instance | in/adp any/det instance/noun | 17 |
| 1793-Washington | 5 | to | upbraidings | to/adp the/det upbraidings/noun | 18 |
| 1793-Washington | 5 | of | who | of/adp all/det who/pron | 19 |
| 1793-Washington | 5 | of | ceremony | of/adp the/det present/adj solemn/adj ceremony/noun | 20 |
- In the above manual annotation (Figure 5.3), phrases highlighted in red are NOT successfully identified by the current regex query, i.e., False Negatives.
We can summarize the pattern retrieval results as:

Most importantly, we can describe the quality of the pattern retrieval with two important measures.
- \(Precision = \frac{True\;Positives}{True\;Positives + False\;Positives}\)
- \(Recall = \frac{True\;Positives}{True\;Positives + False\;Negatives}\)
In our case:
- \(Precision = \frac{18}{18+2} = 90%\)
- \(Recall = \frac{18}{18 + 3} = 85.71%\)
It is always very difficult to reach 100% precision or 100% recall for automatic retrieval of the target patterns. Researchers often need to make a compromise. The following are some heuristics based on my experiences:
- For small datasets, probably manual annotations give the best result.
- For moderate-sized dataset, semi-automatic annotations may help. Do the automatic annotations first and follow up with manual checkups.
- For large datasets, automatic annotations are preferred in order to examine the general tendency. However, it is always good to have a random sample of the data to check the query performance.
- The more semantics-related the annotations, the more likely one would adopt a manual approach to annotation (e.g., conceptual metaphors, sense distinctions, dialogue acts).
- Common annotations of corpus data may prefer an automatic approach, such as Chinese word segmentation, POS tagging, named entity recognition, chunking, noun-phrase extractions, or dependency relations(?).
In medicine, there are two similar metrics used for the assessment of the diagnostic medical tests—sensitivity (靈敏度) and specificity (特異性).
- Sensitivity refers to the proportion of true positives that are correctly identified by the medical test. This is indeed the recall rates we introduced earlier.
-
Specificity refers to the proportion of true negatives that are correctly identified by the medical test. It is computed as follows:
-
\(Specificity = \frac{True\;Negatives}{False\;Positives + True\;Negatives}\)
In plain English, the sensitivity of a medical test indicates the percentage of sick people who are correctly identified as having the disease; the specificity of a medical test indciates the percentage of healthy people who are correctly identified as healthy (i.e., not having the disease).
It should be obvious which metric is more crucial to the control of a pandemic.
5.8 Saving POS-tagged Texts
We may very often get back to our corpus texts again and again when we explore the data. In order NOT to re-tag the texts every time we analyze the data, it would be more convenient if we save the tokenized texts with the POS tags in external files. Next time we can directly load these files without going trough the POS-tagging again.
However, when saving the POS-tagged results to an external file, it is highly recommended to keep all the tokens of the original texts. That is, leave all the word tokens as well as the non-word tokens intact.
A few suggestions:
- If you are dealing with a small corpus, I would probably suggest you to save the resulting data frame from
spacy_parse()as acsvfor later use. - If you are dealing with a big corpus, I would probably suggest you to save the parsed output of each text file in an independent
csvfor later use.
5.9 Finalize spaCy
While running spaCy on Python through R, a Python process is always running in the background and R session will take up a lot of memory (typically over 1.5GB).
spacy_finalize() terminates the Python process and frees up the memory it was using.
Exercise 5.6 In this exercise, please use the corpus data provided in quanteda.textmodels::data_corpus_moviereviews. This dataset is provided as a corpus object in the package quanteda.textmodels (please install the package on your own). The data_corpus_moviereviews includes 2,000 movie reviews.
Please use the
spacyrto parse the texts and provide the top 20 adjectives for positive and negative reviews respectively. Adjectives are naively defined as any words whose pos tags start with “J” (please use the fine-grained version of the POS tags. i.e.,tag, fromspacyr). When computing the word frequencies, please use the lemmas instead of the word forms.- Please provide the top 20 words that are content words for positive and negative reviews ranked by a weighted score, which is computed using the formula provided below. Content words are naively defined as any words whose pos tags start with N, V, or J.
\[Word\;Frequency \times log(\frac{Numbe\; of \; Documents}{Word\;Diserpsion}) \]
- For example, if the lemma action occurs 691 times in the negative reviews collection. These occurrences are scattered in 337 different documents. There are 1000 negative texts in the current corpus. Then the wegithed score for action is:
\[691 \times log(\frac{1000}{337}) = 751.58 \]
In our earlier chapters, we have discussed the issues of word frequencies and their significance in relation to the dispersion of the words in the entire corpus. In terms of identifying important words from a text collection, our assumption is that: if a word is scattered in almost every document in the corpus collection, it is probably less informative. For example, words like a, the would probably be observed in every document in the corpus. Therefore, the high frequencies of these widely-dispersed words may not be as important compared to the high frequencies of those which occur in only a subset of the corpus collection. The word frequency is sometimes referred to as term frequency (tf) in information retrieval; the dispersion of the word is referred to as document frequency (df). In information retrieval, people often use a weighting scheme for word frequencies in order to extract informative words from the text collection. The scheme is as follows:
\[tf \times log(\frac{N}{df}) \]
N refers to the total number of documents in the corpus. The \(log\frac{N}{df}\) is referred to as inversed document frequency (idf). This tf.idf weighting scheme is popular in many practical applications.
The smaller the df of a word, the higher the idf, the larger the weight for its tf.